THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 


VOLUME 24, NUMBER 6 NOVEMBER, 1952 


On the Process of Speech Perception* 


J. C. R. Lick~tmerR 
Acoustics and Lincoln Laboratories, Massachusetts Institute of Technology, Cambridge, Massachusetts 


(Received August 10, 1952) 


The process of speech perception is analyzed into three main operations: (1) translation of the speech 
signal into form suitable for the nervous system, (2) identification of discrete speech elements, and (3) com- 
prehension of meaning. The first operation appears to correspond roughly to the transformation made by 
the sound spectrograph. The second may be carried out by the neural equivalent of a set of matched filters. 
The third appears to involve a neural form of cross-correlation that exhibits some of the properties of the 


analogous electronic process. 


INTRODUCTION 


HE perception of speech appears to involve three 
kinds of operations. First, the speech wave is 
translated into a form suitable for the nervous system 
to handle. Second, the stream of speech is segmented 
into elements, and the elements are recongized or identi- 
fied. And, third, the meaning or significance of the 
message is comprehended. Operations of the first and 
third kinds are obviously necessary parts of the over-all 
process, but their necessity does not dictate very clearly 
their natures or specifications. An operation of the 
second kind is probably not logically necessary, but 
there is so much evidence that speech is basically a 
sequence of discrete elements that it seems reasonable 
to limit consideration to mechanisms that break the 
stream of speech down into elements and identify each 
element as a member, or as probably a member, of one 
or another of a finite number of sets. 


NATURE OF THE SPEECH SIGNAL 


The acoustic speech wave is of course specified 
completely by the sound pressure wave form, the func- 
tion giving the sound pressure at each instant of time. 
The pattern of speech is more readily seen, however, in 
other displays of the signal. The most familiar of these, 
the sound spectrogram of the Bell Telephone Labora- 
tories,'? presents the signal in what appear from the 
auditory point of view to be the natural dimensions, 
intensity, frequency, and time. In these dimensions, 
speech may be specified with reasonable completeness 
by giving the center frequencies, the band widths, and 
the strengths of about three formants, describing the 
momentary spectrum as line, continuous, or mixed, and 
(if there is one) giving the fundamental frequency of 
the periodic part. 

Two other ways of specifying the speech signal are 
of comparable interest. They are based on analyses 
of the process of producing speech but give—or probably 
can be extended to give—a specification in acoustical 
terms. In one mode of specification, developed prin- 


* The preparation of this paper was supported in part by a 
contract between the Air Force Cambridge Research Center and 
M.LT. 

IR. K. Potter, Science 102, 463 (1945). 

2 Potter, Kopp, and Green, Visible Speech (D. Van Nostrand 
Company, New York, 1947). 


cipally by Fant® and K. N. Stevens,‘ the speech wave 
is considered to be the result of applying an excitation 
function to a system of cavities. The signal is specified 
by giving the main facts (voiced or fricative, funda- 
mental frequency, locus of application) about the excita- 
tion function and the main facts (dimensions, damping) 
about the cavities. In this system, as in the intensity- 
frequency-time pattern, we work with about a dozen 
parameters. As we go from one speech sound to the 
next, they vary, most of them continuously but one 
or two in on-off fashion, as functions of time. 

The other alternative to the spectrogram is the 
specification in terms of binary distinctions described 
by Jakobson and his co-workers.® It is not yet entirely 
settled what the acoustic counterparts of some of the 
phonatory distinctions are, but it is nevertheless of in- 
terest to picture speech as the product of an encoding 
that is governed by a set of decisions between paired 
alternatives. One of the decisions determines whether 
or not the signal is voiced, another determines whether 
or not it is frictive, a third determines whether or not 
its spectrum is compact, etc. The encoding may be 
relative rather than absolute: one can think of a distinc- 
tion as being encoded into the change or continuation 
of a part of the acoustic pattern. 

In the present context, the significant thing about the 
ways, just mentioned, of specifying speech is that the 
speech wave form must be translated into another form 
before it makes much sense, either through vision or by 
way of verbal description, to a human being. It is not 
unreasonable to suppose that the transformations em- 
ployed by the auditory system bear some resemblance 
to those that help to make the speech signal compre- 
hensible via indirect presentation. 


MECHANICAL FREQUENCY ANALYSIS 


The first transformation is mechanical or, to be more 
precise, hydrodynamical. The experimental findings of 


3C. G. M. Fant, Tech. Rept. 12 (Acoustics Laboratory, 
Massachusetts Institute of Technology, January, 1950). 

4K. N. Stevens, Quart. Prog. Rept. (Acoustics Laboratory, 
tig wale Institute of Technology, January-March, 1952), 
pp. 3-4. 

5 Jakobson, Fant, and Halle, Tech. Rept. 13 (Acoustics 
ioe Massachusetts Institute of Technology, January, 


590 


ON THE PROCESS OF SPEECH PERCEPTION 


Békésy* and the theoretical work of Ranke,’ Zwislocki,® 
Peterson,’ Bogert,!° and Fletcher" leave little doubt 
that the cochlea acts approximately as a transmission 
line with input terminals (stapes) at one end and taps 
(terminations of neurons of the auditory nerve) distrib- 
uted along its length. Variation of impedance with 
distance from the input end accounts for the fact that 
the transfer characteristic associated with each output 
tap is different from those associated with its neighbors. 
The frequency of maximal response shifts progressively 
down the scale as we go from the base to the apex of the 
cochlea. 

The mechanical action of the cochlea thus corre- 
sponds roughly to the action of the sound spectrograph. 
It makes the transformation from a single, rapidly 
changing wave form to a manifold of slowly varying 
functions distributed in space, and thereby matches 
the speech signal to the capabilities of the nervous 
system. However, if we subsume under “mechanical 
action” only those steps of the cochlear process that 
precede the excitation of the neurons, we find three 
fundamental differences between the cochlear process 
and that of the sound spectrograph. These differences 
suggest that we should make the next stage of the 
auditory process part of the picture before we conclude 
that the first big step in speech perception is a trans- 
formation from the wave form to a representation 
equivalent to an intensity-frequency-time plot. 

(1) The pass bands of the cochlear analyzer are con- 
siderably broader than the corresponding pass bands 
of the spectrograph, even when the spectrograph is 
adjusted for broad response. This poses the problem 
of sharpening mechanisms in the auditory system. If 
the mechanical action of the cochlea is less selective than 
the over-all system appears to be, then it is likely that 
later stages of the auditory process sharpen the me- 
chanical analysis. 

(2) In passing from the input terminals to one of the 
output taps of the cochlea, the signal passes through, 
and is acted upon by, all the segments of the analyzing 
mechanism that lie between the input and that output. 
In passing from the input terminals to one of the 
outputs (i.e., to one of the frequency lines in the display) 
of the sound spectrograph, on the other hand, the signal 
passes through, and is acted upon by, only a single 
band-pass filter. This difference focuses attention upon 
the possibilities of phase-sensitive mechanisms of the 
kind described by Huggins:” phase shifts accumulate 
as the signal passes along the cochlea and provide large 
differences for a phase-sensitive mechanism to work on, 


° G. v. Békésy and W. A. Rosenblith, Handbook of Experimental 
Psychology, S. S. Stevens, editor (John Wiley and Sons, Inc., New 
York, 1951), pp. 1075-1115. 

70. F. Ranke, Z. Biol. 103, 409 (1950). 

8 J. Zwislocki, J. Acoust. Soc. Am. 22, 778 (1950). 

(1956) C. Peterson and B. P. Bogert, J. Acoust. Soc. Am. 22, 369 
0). 

10 B. P. Bogert, J. Acoust. Soc. Am. 23, 151 (1951). 

1H, Fletcher, J. Acoust. Soc. Am. 23, 637 (1951). 

2 W. H. Huggins, J. Acoust. Soc. Am. 24, 582 (1952). 


591 


whereas, in a parallel-resonator analyzer like the spectro- 
graph, the total phase shift is only that introduced by 
a single resonator. 

(3) The mechanical analysis performed by the cochlea 
includes only the linear frequency-selective transforma- 
tions—it does not include the envelope-detecting opera- 
tion that, in the spectrograph, follows the linear trans- 
formation. The signal delivered to the auditory nerve 
therefore had a fine structure in the time domain, and 
some of its detail is preserved in the neural discharges. 
This suggests that the next stage of the auditory process 
may be designed to take advantage of the time detail, 
for we cannot expect it to be preserved through many 


synapses. 
SHARPENING 


It appears probable that the next stage is devoted to 
sharpening the analysis in frequency that was started 
by the mechanical action of the cochlea. Huggins” has 
just described one way in which sharpening may be 
achieved, and other ways have been outlined.™-4 It is 
possible that both the excitation process through which 
the mechanical oscillations influence the neurons of the 
auditory nerve and the neural mechanisms of the lower 
centers of the auditory system (cochlear nucleus, etc.) 
serve to sharpen the mechanical analysis. 

Since hypothetical mechanisms have been discussed 
elsewhere, we need not consider the sharpening process 
in detail. We should, however, take stock of the situa- 


_ tion that we suppose exists at the end of the sharpening 


process. First, we have a sharpened, running frequency 
analysis of the acoustic signal. This corresponds to a 
sound spectrogram: intensity is the density of discharge 
in a small region of the cross section of the ascending 
auditory pathway; frequency is place, or the locus of 
the small region in the cross section; and time is time. 
(However, a restriction has been placed on the rate at 
which intensity can vary as a function of time, for the 
density of discharge is determimed by integrating over 
short intervals of time.) Second, we probably have, in 
another part or dimension of the ascending pathway, a 
representation of the fundamental frequencies of 
periodic components of the acoustic signal. This repre- 
sentation may possibly be preserved in the time domain, 
but it seems more likely that it is encoded into place. 
And, third, we may possibly have a fairly precise repre- 
sentation, in time, of transients in the acoustic signal. 
This is doubtless the case if the fundamental frequencies 
of periodic components are preserved in the form of 
periodic discharges in the higher auditory pathways. 
There is some reason to suppose that a broad-band 
pathway by-passes the sharpening mechanisms and 
carries as much temporal detail as the nervous system 
can retain all the way to the cerebral cortex. Those 
three pieces of information constitute the input to the 


13'W. H. Huggins and J. C. R. Licklider, J. Acoust. Soc. Am. 


23, 290 (1951). 
4 J. C. R. Licklider, Experientia 7, 128 (1951). 


592 J. 


part of the auditory mechanism concerned with the 
recognition of the discrete elements that correspond to 
phonemes, syllables, words, and phrases. 


IDENTIFICATION OF ELEMENTS 


From an engineering point of view, two classes of 
operations are of particular interest whenever the 
problem is to identify a pattern as corresponding to one 
of a set of standard patterns. One of the classes involves 
cross correlation, the other involves matched filtering. 
There may be a rough analogy between these two ways 
of identifying or recognizing and the two approaches 
that come naturally to a person faced with an everyday 
version of the problem. These approaches are comparing 
with a model and testing in a mold. The validity of a 
signature is determined by comparing; the example 
in question is correlated with a standard sample. 
Fingerprints are identified in the same way. But a 
footprint identifies a shoe, a lock specifies a key, and 
an all-but-completed jig-saw puzzle defines the missing 
piece. In general, we identify things by holding them 
up against standards or by trying to fit them into, or 
through, structures to which they are attuned, but 
which they do not directly resemble. Correlating appears 
to be related to the former procedure, filtering to the 
latter. The question is, is either correlation or filtering 
a reasonable model for the part of the auditory process 
that is concerned with identifying elements? 

The first step in examining that question is to consider 
the operations performed by correlators and filters. 
For the sake of simplicity, it is best to restrict the dis- 
cussion to single functions of single variables, but the 
basic notions are readily extended to the manifold 
functions of several variables with which the auditory 
system must operate. . 

The operation of a cross-correlator is specified by the 
cross-correlation function, which it determines: 


r(1)= f f@g(4+-7)dt= J f—n)g(dt. (1) 


This expression indicates that the correlator delays one 
of the two signals by a variable interval 7, multiplies 
the delayed signal by the undelayed signal, and in- 
tegrates the product over time. In existing electronic 
correlators, the operations of delay, multiplication, and 
integration are performed by separate components, and 
the period of integration is, of course, finite. If the 
nature of the signals varies from time to time, it is 
convenient either to determine a series of short-interval 
correlation functions or to employ “running” integration 
of the sort made by a low-pass filter. If the auditory 
process involves correlation, it is of necessity either 
a series of short-interval correlations or a running 
correlation. 

The behavior of a linear filter is governed by its 
impulse response or weighting function, #. The response 
of the filter to an arbitrary input wave is given by the 


C. R. LICKLIDER 


convolution of the input function with the weighting 
function: 


R()= f ft—s)hr)dr. (2) 


This expression says that the output of the filter at 
time ¢ is the sum of the responses of the filter to instan- 
taneous elements of the input (i.e., to a succession of 
impulses so close together that they constitute the 
input wave), the response to each element being given 
the weight appropriate to the interval + between the 
time at which the element occurred and the time ¢. The 
operations specified by (2) are inherent in the action of 
networks comprised of linear parts. An actual electronic 
filter is therefore quite different in form from an actual 
correlator. 

Despite the marked difference between the physical 
realizations of correlation and filtering, the two pro- 
cesses are mathematically the same.'® The equivalence 
may be shown by relating expression (1) to expression 
(2). Starting with the right-hand part of (1), we sub- 
stitute ¢ for 7, and + for ¢. That yields 


r(i)= f f(r—1g(r)dr. (3) 


Since all the values of the product contribute to the 
integral no matter in which order, forwards or back- 
wards in time, they are taken, we can reverse both the 
‘and the 7 scales. Reversing them gives us 


r(i)= f f0—De(—a)dr. (4) 


Compare (4) with (2), and we see that the cross-correla- 
tion function obtained by comparing an input function 
f(t) with a standard function g(¢) has the same shape as 
the response of a filter to f(t) if the impulse response / 
of the filter is the standard function g turned around 
backward on the time scale. Mathematically, therefore, 
there is no basis for choice between correlation and 
filtering: whatever can be said for one can be said for 
the other. 

When we turn to examine the properties of nervous 
tissue in relation to the operations of correlating and 
filtering, we are likely to be struck by the match between 
the requirements of the correlator and the capabilities 
of neurons. The operations of the correlator are to delay 
a signal, to multiply one signal by another, and to deter- 
mine a running integral or a short-time integral. A chain 
of neurons is inherently a delay line. The spatial summa- 
tion associated with a region of synapses provides an 
approximation to multiplication. And the temporal 
summation associated with a region of synapses provides 
an approximation to integration In fact, even the 
quantitative aspects of the requirements and the 
capabilities are reasonably well in agreement. Delays 


 W. H. Huggins, personal communication (1949). 


ON THE PROCESS OF 


and integration times of 0.1 second, which is the order 
of magnitude of a phonemic interval, are not too much 
to ask for in the nervous system. Correlation therefore 
appears to provide a promising model. 

A hypothetical neuronal correlator suitable for identi- 
fying elementary speech sounds has about 50 channels, 
each of them including spatially distributed multipliers 
and integrators capable of handling the neural activity 
described at the end of the section on “Sharpening.” 
Each channel contains a local generator of patterns 
corresponding to a particular sound, in addition to its 
multiplier and integrator. ‘The incoming signal is 
applied to all the channels simultaneously and, if the 
delay line is employed, as it is in the general model, also 
under a number of delay intervals. The space-time 
integrals determined by the correlator are led to a 
comparator, which indicates the integral that has the 
greatest magnitude. That integral identifies the in- 
coming signal as a member of the same class as the 
standard pattern with which the integral is associated. 
A network of the type just described would of course 
require many neurons, but there are an estimated 101° 
neurons in the nervous system; the operations described 
are natural operations, not operations requiring special- 
ized neural structures; and the verbal mechanism is one 
of the most important mechanisms in the brain and 
has a reasonable call for a considerable share of the 
facilities. 

It is of course not feasible to represent diagramaticall? 
all the mechanism just outlined. However, if the delay 
line and the spatial dimension(s) are eliminated and the 
number of channels is reduced to three, schematic 
representation comes within range. Figure 1 shows an 
arrangement for correlating an incoming signal f(t) 
against standards a(#), b(¢), and c(#). As it happens, f is 
more nearly like 6 than either of the other standards, 
and the correlator-comparator identifies f as a member 
of the “b”’ class. 

The hypothetical neuronal correlator is thus a close 
analog of an electronic correlator. The kind of 
filtering that might be performed by the nervous 
system, on the other hand, has little in common with 
the filters used in communication circuits, for there are 
no linear elements in the brain. A neural matched filter 
would have to employ the same basic operations as a 
correlator, and the distinction between the two ap- 
proaches is therefore, insofar as the nervous system is 
concerned, somewhat blurred. Nevertheless, there is 
one aspect of the distinction that remains, and it may 
be important. The correlator employs stored standard 
patterns; patterns are withdrawn from storage and 
compared against the incoming signal. The withdrawal 
from storage or, what amounts to the same thing, the 
local generation, of standard patterns is an active 
process. The filter, on the other hand, has its patterns 
built into its own structure. The identification of an 
incoming pattern is therefore made on the basis of 
which filter, in a set of filters, the signal passes through 


SPEECH PERCEPTION 


(1) (2) (3) (4) (5) (6) 


Fic. 1. Schematic diagram of simplified identification process 
based on correlation. The function f(¢) to be identified is shown at 
(1). Standard functions a(#), b(#), and c(é) are generated by the 
boxes at (2). The incoming and standard functions are correlated 
[multiplied at (3) and integrated at (4) ]. The integrals are then 
led to the comparator (5), which indicates (6) the channel with 
the largest correlation. That identifies the incoming function as 
belonging to the class of the standard of that channel. 


most successfully. This need not be an active process 
in the same sense as the working of the correlator. For 
example, the filter may rest in a quiescent state, waiting 
for a signal to appear. It is ready to function as long as 
its neurons are ready to conduct. Unless some arrange- 
ment is made to start the correlator at the instant a 
sound occurs, on the other hand, the correlator must 
continue even during intervals of silence to generate 
standard signals so that a comparison will be possible 
when speech sounds enter the multipliers. 

Whether or not the ever-readiness of the filter con- 
stitutes an important advantage depends to a consider- 
able extent upon how the process of segmentation is 
handled by the auditory system. There appear to be 
two general ways in which it might be handled. (1) The 
stream of speech could be subdivided into discrete in- 
tervals by a process that preceded the identification of 
the elements thus formed. Or (2) the stream could be 
introduced in a continuous manner into the identifica- 
tion mechanism, the mode of operation of that me- 
chanism being to examine the stream continuously and 
to post an identification whenever the criterion of cor- 
respondence was reached. In the first alternative, 
segmentation is a prerequisite for identification; in the 
second, it is a by-product. 

If the auditory system uses the first method of 
segmenting the stream of speech, the delay line of the 
correlator could be eliminated. (This method is so much 
more economical of facilities than the second method is 
that it seems more reasonable on a priori grounds.) 
And if the delay line is eliminated, filtering has rather 
little advantage of simplicity over correlation as a 
model for the process of identification. On the other 
hand, if the flow of speech is too irregular and too un- 
predictable for the sensitive timing it would take to 
permit elimination of the delay line, then the filter 
model has a definite advantage over the correlation 
model. 

In making a tentative choice between the two models, 
two considerations bear some weight. First, the me- 


594 Ja 


chanism of speech perception is one of the few parts of 
the cerebral apparatus that deal with so few basic 
elements and that take such a long time to develop. 
There may be only 50 or 100 basic elements at the level 
of the first identification process. And it takes some- 
thing like 10 years to perfect the mechanism. It is not 
unreasonable to suppose, therefore, that the selective 
mechanism of speech perception is built into the neural 
tissue in the same sense that the selective mechanism 
of an electronic filter is built into the structure of its 
inductors and capacitors. Second, preliminary experi- 
ments with mistimed speech indicate that perception 
is upset very little, if at all, when a magnetic tape con- 
taining recorded speech is spliced in such a way as to 
advance or delay the beginnings and ends of speech 
sounds. This is true even if the intensity pattern is 
modified by extreme peak clipping. The two considera- 
tions just mentioned, taken together with the feeling 
that the identification of speech elements is a most 
automatic and passive process, constitute an argument, 
though a weak one, that the identification process 
bears more relation to filtering than to correlation. 


COMPREHENSION OF MEANING 


There is some reason for thinking that comprehension 
is more a matter of correlation than of filtering. One of 
the main principles of pedagogy is that understanding is 
an active process, that the student must participate if he 
is to comprehend. It would appear that the active 
participation that is required is the generation of 
internal signals against which to correlate the incoming 
messages. 

From the point of view of engineering design, the 
question is one of bringing incoming messages into rela- 
tion with stored patterns. In the case of the filter, as we 
have been conceiving it, the pattern is stored in the 
structure of the filter. It is therefore necessary to lead 
the incoming pattern to the filter if the two are to 
interact. Since the incoming pattern must be tried out 
on many filters, it must be led to many places. In the 
case of the correlator, on the other hand, one can think 
of the correlating mechanism as being indifferent to the 
particular patterns upon which it operates. Patterns 
may be stored at any place in the nervous system, and 
still be directed to the correlating mechanism for com- 
parison against messages coming in through sensory 
channels or arising in other parts of the brain. This 
provides a flexibility that is not inherent in the filter 
model. 

Thinking of the mechanism of comprehension in this 
way makes it clear why, to be meaningful, messages 
must be redundant, why messages encoded for human 
reception are repetitious and coherent. The reason is 
that patterns must be called up from storage for correla- 
tion against the incoming pattern. The patterns called 
from storage must be selected, for there would be no 
point in determining correlations with patterns drawn 
at random. It is therefore very important that the 
incoming messages arrive in such an order that appro- 


Cc. R. LICKLIDER 


priate comparison patterns can be activated ahead of 
time. Evidently, comparison patterns are stored in an 
orderly way, with related patterns in functional associa- 
tion. When a stored pattern turns out to have a strong 
correlation with an incoming pattern, that confirmation 
causes activation of other patterns closely bonded to 
the successful one. When a correlation turns out to be 
low, patterns close to the unsuccessful one are. sup- 
pressed. According to this schema, comprehension is a 
game of “hot and cold,” the clues being given by the 
degrees of correlation between the incoming messages 
and the patterns drawn from storage. The result of the 
game is not so much to add new patterns to the storage 
as to restructure the store by creating new bonds and 
strengthening or weakening old ones among the stored 
patterns. 


CAVITY DIMENSIONS AND BINARY DISTINCTIONS 


In the discussion of identification, we assumed that 
the spectrogram, or some abstraction from it that 
reduces variability among utterances of sounds of the 
same class, is the basic datum upon which the identifica- 
tion process operates. We should ask how the specifica- 
tions of speech in terms of cavity dimensions and in 
terms of binary distinctions fit into the hypothetical 
picture. 

The parts of the verbal mechanism that control the 
production of speech and the perception of speech 
develop in close interrelation. Obviously, the nervous 
system contains within it the transformations that take 
the sensory patterns of the various sounds over into the 
motor patterns: echoic behavior is primitive, universal, 
and automatic. Whatever the nature of the sensory 
pattern, therefore, it is clear that a related pattern that 
contains an encoding of the phonation of the speech 
sound—i.e., a pattern equivalent to the Fant-Stevens 
specification— is immediately within reach. It is possible 
that the process of identification operates on the motor 
form of the signal rather than upon the sensory form. 
Or, more probably, the process involves both the sensory 
and the motor patterns. In either event, the specification 
in terms of cavity and excitation parameters would be 
a “natural” specification of speech. 

Binary distinctions could be extracted, of course, 
either from the sensory or the motor patterns. It is 
possible that the motor patterns, being stabler and more 
complete, would provide a better starting point. Given 
the binary distinctions, identification could proceed 
either by correlation or by filtering. Essentially, the 
binary distinctions provide a way of simplifying the 
patterns on which the identifying mechanism acts. 
Since the distinctions are to some extent variable and 
since they contain less information than the more 
complete patterns from which they were derived, one 
would expect an identification process based on binary 
distinctions to make considerable use of conditional 
probability structure in deciding among alternatives 
not clearly eliminated by information carried by the 
current set of distinctions. 


